The violent crime rate in U.S increased by 3.4 percent nationwide in 2016 in US. As an international student, as well as a New Yorker, the public safety in NYC is always a concern to us, especially after the recent terrorists attack near the World Trade Center. Thus, our group decided to make a deeper investigation of the crime data and seek out some underlying reasons which led to the increase of crime rate.
The New York City Police Department provides overall crime dataset. NYPD also established a CompStat model, called CompStat 2.0, providing greater specificity about crimes through an online interactive experience.\ On the official website of new york city, there is also a Crime Map which enables people to view crime by precinct. This map includes crimes of seven major felonies.
Since the dataset has 341716, 9, we randomly sample 50000 observations and creat an interactive map showing locations where the crimes in New York City occured:
sample <- nyc_crime_2017[sample(1:nrow(nyc_crime_2017), 50000,replace=FALSE),]
sample %>%
mutate(text_label = str_c("Offense desc:", ofns_desc, ' Boro: ', boro)) %>%
plot_ly(x = ~longitude, y = ~latitude, type = "scatter", mode = "markers",
alpha = 0.5,
color = ~ofns_type,
text = ~text_label)
*Collect historic data of crimes
nyc_hist_vio = read_excel("./historic/violation-offenses-2000-2016.xls", range = "A4:R6") %>%
mutate(ofns_type = "VIOLATION")
nyc_hist_felony_7 = read_excel("./historic/seven-major-felony-offenses-2000-2016.xls", range = "A5:R12") %>%
mutate(ofns_type = "FELONY")
nyc_hist_felony = read_excel("./historic/non-seven-major-felony-offenses-2000-2016.xls", range = "A5:R13") %>%
mutate(ofns_type = "FELONY")
nyc_hist_mis = read_excel("./historic/misdemeanor-offenses-2000-2016.xls", range = "A4:R21")%>%
mutate(ofns_type = "MISDEMEANOR")
We combine the information of crimes in past 16 years.
nyc_crime_hist = nyc_hist_mis %>%
full_join(nyc_hist_felony) %>%
full_join(nyc_hist_felony_7) %>%
full_join(nyc_hist_vio) %>%
mutate(ofns_type = as.factor(ofns_type), ofns_desc = OFFENSE) %>%
select(-OFFENSE)
nyc_crime_hist = nyc_crime_hist %>%
gather(key = year, value = count, "2000":"2016") %>%
group_by(year, ofns_type) %>%
summarize(n = sum(count)/12) %>%
full_join(nyc_crime_2017 %>%
group_by(ofns_type) %>%
summarize(n = n()/10) %>%
mutate(year = "2017")) %>%
ungroup()
nyc_crime_hist %>%
mutate(year = as.numeric(year)) %>%
ggplot(aes(x = year, y = n, fill = ofns_type)) + geom_bar(stat = "identity")
Bar chart showing crime number and offense type in different boros:
barplot = nyc_crime_2017 %>%
mutate(boro = fct_infreq(boro)) %>%
ggplot(aes(x = boro, fill = ofns_type)) + geom_bar()
ggplotly(barplot)
The graph reveals the number of crime vs boro. Among 5 boros, Brooklyn has the highest number of crime in the frist 10 months in 2017. Offense type includes felony, misdemeanor and violation. Misdemeanor is the most frequent offense types across these 5 boros.
The count of different type of crimes based on 2017 data
crime_tidy2 = nyc_crime_2017 %>%
group_by(date, ofns_type) %>%
summarize(crime_count = n())
ggplot(crime_tidy2, aes(x = date, y = crime_count, color = ofns_type)) +
geom_point(alpha = .6) + geom_smooth() +
theme(legend.position = "bottom")
We then focused on crime data of current year.
Make a plot of crime count versus hour in a day and group by boro.
nyc_crime_2017 %>%
mutate(hour = hour(time)) %>%
group_by(hour, boro) %>%
summarize(n = n()) %>%
ggplot(aes(x = hour, y = n, color = boro)) + geom_point(alpha = 0.5) + geom_line()
Make a crime rate plot based on 2017 data
crime_tidy = nyc_crime_2017 %>%
separate(date, into = c("year", "month","day"), sep = "-") %>%
select(-year, -day) %>%
group_by(month,boro) %>%
summarize(crime_count = n())
crimetotal = ggplot(crime_tidy, aes(x = month, y = crime_count, color = boro)) +
geom_point() + geom_path(aes(group = boro)) +
theme(legend.position = "bottom")
crime_rate = crime_tidy %>%
mutate(popluation = recode(boro, "BRONX" = 1455720,
"BROOKLYN" = 2629150,
"MANHATTAN" = 1643734,
"QUEENS" = 2333054,
"STATEN ISLAND" = 476015)) %>%
mutate(crime_rate = (crime_count/popluation)*100000)
crimerate = ggplot(crime_rate, aes(x = month, y = crime_rate, color = boro)) +
geom_point() + geom_path(aes(group = boro)) +
theme(legend.position = "bottom")
library(gridExtra)
grid.arrange(crimetotal, crimerate, ncol = 2)
crime_type = crime_2017 %>%
mutate(prem_typ=as.character(prem_typ),
boro=as.character(boro),
ofns_type=as.character(ofns_type),
ofns_desc=as.character(ofns_desc))
crime_type$prem_typ[grep("RESIDENCE",crime_type$prem_typ)] = "RESIDENCE"
place_offense = function(x){
crime_boro = crime_type %>%
filter(boro == x) %>%
group_by(prem_typ) %>%
summarize(crime = n()) %>%
arrange(desc(crime)) %>%
top_n(10)
top_place = crime_boro$prem_typ
mat <- matrix(ncol = 3, nrow = 4)
for (i in 1:4) {
offense_select = crime_type %>%
filter(prem_typ == top_place[i],
boro == x) %>%
group_by(ofns_desc) %>%
summarize(crime = n()) %>%
arrange(desc(crime)) %>%
top_n(3)
top_offense = offense_select$ofns_desc
mat[i,] <- top_offense
}
return(mat)
}
place_offense("QUEENS")
## [,1] [,2]
## [1,] "HARRASSMENT 2" "ASSAULT 3 & RELATED OFFENSES"
## [2,] "CRIMINAL MISCHIEF & RELATED OF" "PETIT LARCENY"
## [3,] "PETIT LARCENY" "HARRASSMENT 2"
## [4,] "CRIMINAL MISCHIEF & RELATED OF" "HARRASSMENT 2"
## [,3]
## [1,] "CRIMINAL MISCHIEF & RELATED OF"
## [2,] "ASSAULT 3 & RELATED OFFENSES"
## [3,] "CRIMINAL MISCHIEF & RELATED OF"
## [4,] "PETIT LARCENY"
place_offense("BROOKLYN")
## [,1] [,2]
## [1,] "HARRASSMENT 2" "ASSAULT 3 & RELATED OFFENSES"
## [2,] "CRIMINAL MISCHIEF & RELATED OF" "PETIT LARCENY"
## [3,] "CRIMINAL MISCHIEF & RELATED OF" "PETIT LARCENY"
## [4,] "PETIT LARCENY" "GRAND LARCENY"
## [,3]
## [1,] "PETIT LARCENY"
## [2,] "ASSAULT 3 & RELATED OFFENSES"
## [3,] "HARRASSMENT 2"
## [4,] "HARRASSMENT 2"
place_offense("STATEN ISLAND")
## [,1] [,2]
## [1,] "HARRASSMENT 2" "ASSAULT 3 & RELATED OFFENSES"
## [2,] "CRIMINAL MISCHIEF & RELATED OF" "INTOXICATED & IMPAIRED DRIVING"
## [3,] "CRIMINAL MISCHIEF & RELATED OF" "HARRASSMENT 2"
## [4,] "PETIT LARCENY" "GRAND LARCENY"
## [,3]
## [1,] "CRIMINAL MISCHIEF & RELATED OF"
## [2,] "HARRASSMENT 2"
## [3,] "PETIT LARCENY"
## [4,] "HARRASSMENT 2"
place_offense("BRONX")
## [,1] [,2]
## [1,] "HARRASSMENT 2" "ASSAULT 3 & RELATED OFFENSES"
## [2,] "PETIT LARCENY" "CRIMINAL MISCHIEF & RELATED OF"
## [3,] "OFFENSES AGAINST PUBLIC ADMINI" "FELONY ASSAULT"
## [4,] "PETIT LARCENY" "GRAND LARCENY"
## [,3]
## [1,] "OFF. AGNST PUB ORD SENSBLTY &"
## [2,] "DANGEROUS DRUGS"
## [3,] "PETIT LARCENY"
## [4,] "HARRASSMENT 2"
place_offense("MANHATTAN")
## [,1] [,2]
## [1,] "HARRASSMENT 2" "ASSAULT 3 & RELATED OFFENSES"
## [2,] "PETIT LARCENY" "CRIMINAL MISCHIEF & RELATED OF"
## [3,] "FORGERY" "GRAND LARCENY"
## [4,] "PETIT LARCENY" "GRAND LARCENY"
## [,3]
## [1,] "GRAND LARCENY"
## [2,] "ASSAULT 3 & RELATED OFFENSES"
## [3,] "CRIMINAL MISCHIEF & RELATED OF"
## [4,] "HARRASSMENT 2"
nyc_crime = read_csv("./NYPD_Complaint_Data_Current_YTD.csv") %>%
clean_names() %>%
select(boro = boro_nm)
crime_number = nyc_crime %>%
group_by(boro) %>%
summarise(n = n())
population = read_csv("./NYC_Population_by_Borough.csv") %>%
mutate(boro = Borough) %>%
select(-Borough)
nyc_crime_population = left_join(population, crime_number, by = "boro") %>%
clean_names() %>%
mutate(population = as.numeric(population)) %>%
mutate(crime_rate = n / population * 100000)
url = "http://www.baruch.cuny.edu/nycdata/income-taxes/med_hhold_income.htm"
household_income = read_html(url)
income_by_region = (household_income %>%
html_nodes(css = "table"))[[1]] %>%
html_table() %>%
.[c(8,9,10,11,12), c(3,4)]
as_tibble()
## # A tibble: 0 x 0
income_by_region = income_by_region %>%
clean_names() %>%
mutate(boro = x3, median_income = x4) %>%
select(-x3, -x4)
income = read_csv("./NYC_Income_by_Borough.csv") %>%
clean_names() %>%
mutate(boro = borough) %>%
select(-borough)
crime_income = left_join(income, nyc_crime_population, by = "boro")
crime_income %>%
ggplot(aes(x = income, y = crime_rate, color = income)) + geom_point(alpha = 0.5) + geom_smooth() +
labs(title = "Corelation between family median income and crime rate in each borough",
x = "Income Range",
y = "Crime rate")
In addition, we have a strong interest in finding potential factors that may associated with criminal rate. In this case, we choose household income level. After reading data from the web, data cleaning and data visualization, we are surprized to see from the scatter plot: Both lower-income borough and higher-income borough have an extremely high crime rate. For example, Bronx borough’s family median income is 35176 dollars, associated with a crime rate of 0.029. That is, we expect 29 crime cases among every 1000 people. In contrast, Family income ranged between 60000 dollars to 70000 dollars tends to have the lowerest crime rate. Taking Queens as an example, we expect only 15 crime cases among every 1000 people.
library(tidytext)
crime_words = nyc_crime_2017 %>%
select(-longitude, -latitude) %>%
mutate(ofns_desc = str_to_lower(ofns_desc),
ofns_desc = str_replace(ofns_desc, "[2-3]",""),
ofns_desc = as.character(ofns_desc)) %>%
unnest_tokens(word, ofns_desc)
data(stop_words)
crime_word_tidy =
anti_join(crime_words, stop_words)
crime_word_tidy %>%
count(word, sort = TRUE) %>%
top_n(10) %>%
mutate(word = fct_reorder(word, n)) %>%
ggplot(aes(x = word, y = n)) +
geom_bar(stat = "identity", fill = "blue", alpha = .6) +
coord_flip()
The graph analyzes top 10 words showing in offense description. The most frequent one is larceny, which appears nearly 100000 times. Other frequent words including related, petit, assault, harrassment, etc. Most of them indicated the type of crime, which is consistent with what we expect.
word_ratios = crime_word_tidy %>%
filter(ofns_type %in% c("VIOLATION" , "FELONY")) %>%
count(word, ofns_type) %>%
group_by(word) %>%
filter(sum(n) >= 5) %>%
ungroup() %>%
spread(ofns_type, n, fill = 0) %>%
mutate(
violation_odds = (VIOLATION + 1) / (sum(VIOLATION) + 1),
felony_odds = (FELONY + 1) / (sum(FELONY) + 1),
log_OR = log(felony_odds / violation_odds)
) %>%
arrange(desc(log_OR))
word_ratios %>%
mutate(pos_log_OR = ifelse(log_OR > 0, "felony_odds >violation_odds" ,"violation_odds > felony_odds")) %>%
group_by(pos_log_OR) %>%
top_n(10, abs(log_OR)) %>%
ungroup() %>%
mutate(word = fct_reorder(word, log_OR)) %>%
ggplot(aes(word, log_OR, fill = pos_log_OR)) +
geom_col() +
coord_flip() +
ylab("log odds ratio (felony_odds/violation_odds)") +
scale_fill_discrete(name = "") +
theme(legend.position = "bottom")
The above chart compares distinct words(that is, words that appear much more frequently in one group than the other) in offense type of violation and felony. We can see that larceny, robbery, burglary,etc., appear more frequently in offense description of felony crime, while harrassment, gambling, loitering appear more frequently in offense description of violation crime. In terms of the results, we can obtain a basic picture of the difference between felony and violation.